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Abstract 

Background: Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, 
diagnosis and treatment of complex human diseases. A recent study in automatic detection of epistatic 
interactions shows that Markov Blanket-based methods are capable of finding genetic variants strongly associated 
with common diseases and reducing false positives when the number of instances is large. Unfortunately, a typical 
dataset from genome-wide association studies consists of very limited number of examples, where current 
methods including Markov Blanket-based method may perform poorly. 

Results: To address small sample problems, we propose a Bayesian network-based approach (bNEAT) to detect 
epistatic interactions. The proposed method also employs a Branch-and-Bound technique for learning. We apply 
the proposed method to simulated datasets based on four disease models and a real dataset. Experimental results 
show that our method outperforms Markov Blanket-based methods and other commonly-used methods, especially 
when the number of samples is small. 

Conclusions: Our results show bNEAT can obtain a strong power regardless of the number of samples and is 
especially suitable for detecting epistatic interactions with slight or no marginal effects. The merits of the proposed 
approach lie in two aspects: a suitable score for Bayesian network structure learning that can reflect higher-order 
epistatic interactions and a heuristic Bayesian network structure learning method. 



Background 

Genome-wide association study (GWAS) focuses on stu- 
dies of the genetic variants related with a variety of dis- 
eases from individual to individual among a cohort of 
cases (people with the disease) and controls (similar 
people without the disease) [1-3]. The most important 
category of genetic variations is SNP (Single Nucleotide 
Polymorphism), which influences disease risk. Conven- 
tional analysis methods for GWAS data only consider 
one SNP at a time by the Armitage trend test (ATT) 
and are likely to miss genetic variants having slight to 
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moderate marginal effects but strong joint effects on 
disease risk. Moreover, it is widely acknowledged that 
some common complex diseases such as various types 
of cancers, cardiovascular disease, and diabetes are 
caused by multiple genetic variants [4]. Therefore, there 
is an urgent need to detect high-order epistasis (gene- 
gene interaction), which refers to the interactive effect 
of two or more genetic variants on complex human dis- 
eases, and explore how these epistatic interactions con- 
fer susceptibility to complex diseases [5]. However, the 
very large number of SNPs checked in a typical GWAS 
(more than 10 million) and the enormous number of 
possible SNP combinations make detecting high-order 
epistatic interactions from GWAS data statistically and 
computationally challenging [6,7]. 
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During the past decade, some heuristic computational 
methods have been proposed to detect causal interacting 
genes or SNPs. One type of computational methods for 
epistatic interactions detection are statistical methods 
including multifactor dimensionality reduction (MDR) 
[8-11], penalized logistic regression (stepPLR [12], las- 
soPLR [13]), and Bayesian epistasis association mapping 
(BEAM) methods [14]. MDR is a non-parametric and 
model-free method based on constructing a risk table 
for every SNP combination [11]. If the case and control 
ratio in a cell of this risk table is larger than 1, MDR 
will label it as "high risk", otherwise, "low risk". By the 
risk table, MDR can predict disease risk and will select 
the SNP combination with the highest prediction accu- 
racy. StepPLR and lassoPLR make some modifications 
to avoid the overfitting problem of standard logistic 
regression when detecting epistatic interactions [15]. For 
example, stepPLR combines the LR criterion with a 
penalization of the L2-norm of the coefficients. This 
modification makes stepPLR more robust to high-order 
epistatic interactions [12]. In general, most statistical 
methods can only be applied to small-scale analysis (i.e., 
a small set of SNPs) due to their computational com- 
plexity. Moreover, MDR, stepPLR and lassoPLR are all 
predictor-based methods, which make them easy to 
include false positives. Comparing to MDR, stepPLR 
and lassoPLR, BEAM is a scalable and non-predictor- 
based statistical method [14]. BEAM partitions SNPs 
into three groups: group 0 is for normal SNPs, group 1 
contains disease SNPs affecting disease risk indepen- 
dently, and group 2 contains disease SNPs that jointly 
contribute to the disease risk (interactions). Give a fixed 
partition, BEAM can get the posterior probability of this 
partition from SNP data based on Bayes theory. A Mar- 
kov Chain Monte Carlo method is used to reach the 
optimal SNP partition with maximum posterior prob- 
ability in BEAM. One drawback of BEAM is that identi- 
fying both single disease SNP and SNP combinations 
simultaneously make BEAM over-complex and weakens 
its power. 

An alternative approach is machine learning based 
methods, which are based on binary classification (pre- 
diction) and treat cases as positives and controls as 
negatives in SNP data. Support vector machine-based 
approaches [16] and random forest-based approaches 
[17] are two commonly-used machine learning methods 
for epistatic interactions detection. They use SVM or 
random forest as a predictor and select a set of SNPs 
with the highest prediction/classification accuracy by 
feature selection. Like predictor-based statistical meth- 
ods, machine learning-based methods lack the capability 
of detecting causal elements and tend to introduce 
many false positives, which may result in a huge cost for 
further biological validation experiments [18]. 



Recently, we propose a new Markov Blanket-based 
method, DASSO-MB, to detect epistatic interactions in 
case-control studies [18]. The Markov Blanket is a mini- 
mal set of variables, which can completely shield the tar- 
get variable from all other variables based on Markov 
condition property. Thus, DASSO-MB can detect the 
SNP set that shows a strong association with diseases 
with the fewest false positives. Furthermore, the heuris- 
tic search strategy in DASSO-MB can avoid the time- 
consuming training process as in SVMs and Random 
Forests. 

In this paper, we address the problems by introducing 
a Bayesian networks-based method, which also employs 
a Branch-and-Bound technique to detect epistatic inter- 
actions. Bayesian networks provide a succinct represen- 
tation of the joint probability distribution and 
conditional independence among a set of variables. In 
general, a structure learning methods for Bayesian net- 
works first defines a score reflecting the fitness between 
each possible structure and the observed data, and then 
searches for a structure with the maximum score. Com- 
paring to Markov Blanket based methods, the merits of 
applying Bayesian networks method to epistatic interac- 
tion detection includes: (1) BDE, BIC or MDL scores for 
Bayesian network structure learning can reflect higher- 
order interactions and are not sample-consuming; and 
(2) heuristic Bayesian network structure learning 
method can solve the classical XOR problem, which 
may hinder the applications of Markov blanket based 
approaches. 

We apply the bNEAT (Bayesian Networks based Epi- 
static Association sTudies) method to simulated datasets 
based on four disease models and a real dataset (the 
Age-related Macular Degeneration (AMD) dataset). We 
demonstrate that the proposed method outperforms 
Markov Blanket methods and other commonly-used 
methods, especially when the number of samples is 
small. 

Results 

Analysis of simulation data 

We first evaluate the proposed bNEAT method on 
simulated data sets, which are generated from three 
commonly used two-locus epistatic models in [15] and 
one three-locus epistatic model developed in [14]. 
Model-1 is a multiplicative model, model-2 demon- 
strates two-locus interaction multiplicative effects and 
model-3 specifies two-locus interaction threshold effects. 
There are three disease loci in model-4 [14]. Some cer- 
tain genotype combinations can increase disease risk in 
model-4 and there are almost no marginal effects for 
each disease locus. 

To compare the performance of different methods, we 
use the same data generation process and the similar 
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parameter settings as in [14,15,18]. We generate 50 
datasets and each contains 100 markers genotyped for 
1,000 cases and 1,000 controls. To measure the perfor- 
mance of each method, we use "power" as the criterion 
function. Power is calculated as follows: 



Power = — 
N 



(1) 



where N is the total number of simulated datasets and 
No is the number of simulated datasets in which all dis- 
ease associated markers are identified without any false 
positives. 

We compare the bNEAT algorithm with four meth- 
ods: BEAM, Support Vector Machine, MDR and 
DASSO-MB on the four simulated disease models. The 
BEAM software is downloaded from http://www.fas.har- 
vard.edu/~junliu/BEAM and we set the threshold of the 
B statistic as 0.1 [14]. For support vector machines, we 
use LIBSVM with a RBF kernel to detect gene-gene 
interactions and the detail is shown in [18]. Since MDR 
algorithm can not be applied to a large dataset directly, 
we first reduce the number of SNPs to 10 by ReliefF 
[19], a commonly-used feature selection algorithm, and 
then MDR performs an exhaustive search for a SNP set 
that can maximize cross-validation consistency and pre- 
diction accuracy. For DASSO-MB, we set the threshold 
of G test as 0.01 to determine (conditional) dependence 
and (conditional) independence. 

The results on the simulated data are shown in 
Figures 1 and 2. As can be seen, among the five meth- 
ods, the bNEAT algorithm performs the best. BEAM is 
worse than both bNEAT and DASSO-MB. One possible 
reason is that BEAM tries to detect single disease loci 
and epistatic interactions simultaneously. This strategy 
is unnecessary and makes BEAM over-complex. The 
other possible reason is that BEAM uses fixed Dirichlet 
priors in its Bayesian marker partition model, which 
may not reflect and penalize the model complexity 
appropriately [20]. 

Typically, GWAS can not generate a large number of 
samples due to the high experiment cost. Thus, the per- 
formance of various computational methods for epistatic 
interaction detection in case of small samples is impor- 
tant. We explore the effect of the number of samples on 
the performance of bNEAT, DASSO-MB, BEAM and 
SVM. We generate synthetic datasets containing 40 
markers genotyped for different number of cases and 
controls with = 1 and MAF=0.5. 

The results are shown in Figure 3. We find that 
bNEAT is more sample-efficient than other methods in 
that it can achieve the highest power when the number 
of samples is the same. In addition, it needs fewer sam- 
ples to reach the perfect power comparing to other 
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Figure 1 Performance comparison for = 0.7 The power is 
defined as the proportion of simulated datasets whose result only 
contains disease associated markers without any false positives. 



methods. DASSO-MB is the second best. For models 1- 
3, almost all methods can obtain a perfect power except 
SVM when the number of samples is larger than 4000. 
SVM can not achieve a perfect power even though we 
have sufficient samples (> 8000). This may indicate that 
the predictor-based methods lack the ability to find cau- 
sal elements precisely. The result from model-4 is parti- 
cularly interesting: bNEAT exhibits overwhelming 
superiority over other three methods, as bNEAT yields a 
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Figure 2 Performance comparison for r = 1 The power is 
defined as the proportion of simulated datasets whose result only 
contains disease associated markers without any false positives. 



perfect power even the number of samples is small 
(around 400), which indicates that bNEAT is especially 
suitable for detecting epistatic interactions with slight or 
no marginal effects. 

Results on AMD data 

In this section, we apply bNEAT to large-scale (large 
number of SNPs but small samples) datasets in real gen- 
ome-wide case-control studies, which often require 
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genotyping of 30,000-1,000,000 common SNPs. We 
make use of an Age-related Macular Degeneration 
(AMD) dataset containing 116,204 SNPs genotyped with 
96 cases and 50 controls [21]. Multiple genetic factors 
cause AMD, which can result in a loss of vision. 

To remove inconsistently genotyped SNPs, we per- 
form filtering process as in [18]. After filtering, there are 
97,327 SNPs remained. Since the number of SNPs is 
very large, restricting the search space to avoid unrea- 
sonable search by selecting some candidate SNPs as in 
[22] is necessary. We select top 200 candidate SNPs 
based on G test and then use bNEAT to identify 



Han and Chen BMC Genomics 201 1, 12(Suppl 2):S9 
http://www.biomedcentral.eom/1 471 -2 1 64/1 2/S2/S9 



Page 5 of 8 



disease SNPs related with AMD. bNEAT detects three 
associated SNPs: rs380390, rs3913094 and rsl0518433. 
The first SNP, rs380390, is already found in [21] with a 
significant association with AMD. Although no evi- 
dences were reported with the other two SNPs related 
to AMD in the literature, they may be plausible candi- 
date SNPs associated with AMD. 

Conclusions and discussion 

Comparing with many computational methods used for 
identification of epistatic interactions, Markov Blanket 
based method can increase power and reduce false posi- 
tives. However, Markov Blanket based method is sam- 
ple-consuming and the greedy searching strategy in 
Markov Blanket method is not suitable for detecting 
some interaction models with no independent main 
effects for each disease locus. In this paper, we propose 
a Bayesian networks method based on Branch-and- 
Bound technique (bNEAT) to detect epistatic interac- 
tions. We demonstrate that the proposed bNEAT 
method significantly outperforms Markov Blanket 
method and other commonly-used methods, especially 
when the number of samples is small. 

Even though the bNEAT method is more powerful 
than Markov Blanket based method, it can not be 
directly applied to genome-wide dataset due to the large 
number of SNPs. Integrating Markov chain Monte 
Carlo or simulated annealing technique into our bNEAT 
method to make it scalable to genome-wide dataset is 
one direction for future research. Moreover, we will 
explore different score schemes for epistatic interaction 
detection by Bayesian networks. For example, informa- 
tion-based score schemes (e.g., AIC score and BIC 
score) are derived in case of large number of samples 
[23]. When the number of samples is small, the approxi- 
mation in the inference of both AIC score and BIC 
score can not hold any more. In fact, the penalty term 
for model complexity in AIC score and BIC score can 
also reflect the variance of the model [24]. Thus in our 
future work, we will design a new score scheme by esti- 
mating the penalty term from data to make sure that 
the score scheme can fit data better. 

Methods 

Bayesian networks 

A Bayesian network is a directed acyclic graph (DAG) G 
consisting of nodes corresponding to a random variable 
set X = {Xi, X2, X„} and edges between nodes, which 
determine the structure of G and therefore the joint 
probability distribution of the whole network [25]. 

Definition 1 (Conditional Independence)for three 
random variables (nodes) X, Y and Z, if the probability 
distribution of X conditioned on both Y and Z is equal 



to the probability distribution of X conditioned only on 
Y, i.e., P{X \ Y, Z) = P(X \ Y), X is conditionally indepen- 
dent of Z given Y. 

This conditional independence is represented as . 
Similarly, represents conditional dependence [26]. 

Theorem 1 (Local Markov Assumption)£ac/z vari- 
able is conditionally independent of its nondescendants, 
given its parents in the DAG G. 

By applying the local Markov assumption, the joint 
probability distribution / can be represented as 

n 

P{X„...,X„) = Y{P{^APa{X.;)) (2) 

where Pa{Xi) denotes the set of parents of Xt in G . 
Therefore, there are two components in a Bayesian net- 
work. The first component is the DAG G reflecting the 
structure of the network. The second component, 6, 
describes the conditional probability distribution P{Xi \ 
Pa{Xi)) to specify the unique distribution / on G. 

Definition 2 (V-structure)for three nodes X, Y and Z 
in a Bayesian network, a structure with the form 
of X^Z9lY (no edge between X and Y) is called a 
v-structure. 

Definition 3 (D-seperation)f or three nodes X, Y and 
Z in a Bayesian network, if there is no active path 
between X and Y given Z, we say that X and Y are 
d-seperated given Z, denoted as Dsep(X;Y \ Z). 

Bayesian networks allow us to explore causal 
relationships to perform explanatory analysis and make 
predictions. As shown in Figure 4, GWAS attempts 
to identify the /c-way interaction among SNPs: SNPi, 
SNP2,..., SNPk, which are associated with a disease. The 
n SNP nodes and the disease status/label node construct 
a Bayesian network and we want to determine which 
SNP nodes are the parent nodes of the disease status 
/label node. 

Structure learning of Bayesian networks 

Even though a Bayesian network can be constructed by 
an expert, most tasks of determining the network struc- 
ture are too complex for humans. We have no choice 
but to learn the network structure and parameters from 
data. There are two types of structure learning methods 
for Bayesian networks: constraint-based methods and 
score-and-search methods. 

The constraint-based methods first build the skeleton 
of the network (undirected graph) by a set of depen- 
dence and independence relationships. Next constraint- 
based methods direct links in the undirected graph to 
construct a directed graph with d-separation properties 
corresponding to the dependence and independence 
determined [27-29]. Even though constraint-based 



Han and Chen BMC Genomics 201 1, 12(Suppl 2):S9 
http://www.biomedcentral.eom/1 471 -2 1 64/1 2/S2/S9 



Page 6 of 8 




SNPn 



Figure 4 An Example of Genome-wide Association Studies. The goal of genome-wide association studies is to identify the /c-way interaction 
among SNPs: SNPi, SNP2,...,SNPk, which are associated with disease. 



methods are developed with a rigorous theoretical foun- 
dation, errors in conditional dependence and indepen- 
dence will affect the stability of constraint-based 
methods and this problem is especially serious when the 
number of samples is small. 

The score-and-search methods view a Bayesian net- 
work as a statistical model and transform the structure 
learning of Bayesian network into a model selection pro- 
blem [30]. To select the best model, a score function is 
needed to indicate the fitness between a network and 
the data. Then the learning task is to find the network 
with the highest score. Thus, score-and-search methods 
typically consist of two components, (1) a score func- 
tion, and (2) a search procedure. In this paper, we focus 
on structure learning approaches for Bayesian networks 
based on score-and-search methods because score-and- 
search methods are more robust for small data sets than 
constraint-based methods. 

One of the most important issues in score-and-search 
methods is the selection of score function. A natural 
choice of score function is the likelihood function. How- 
ever, the maximum likelihood score often overfits the 
data because it does not reflect the model complexity. 
Therefore, a good score function for Bayesian networks' 
structure learning must have the capability of balancing 
between the fitness and the complexity of a selected 
structure. There are several existing score functions 
based on a variety of principles, such as the information 
theory and minimum description length (BIC score, AIC 
score, MDL score) [31-33] and Bayesian approach (BDe 
score) [34]. 

The general idea of BDe score is to compute the pos- 
terior probability distribution. Consider that we want to 
learn the structure 5 of a Bayesian network containing n 



nodes from a dataset D with N examples, and let qi 
denote the number of configurations of the parent set 
Pa{Xi) of Xi and let r,- represent the number of states of 
Xi, the BDe score is obtained as 



1=1 ;=1 



r(fl, + N,) 



n 



'I' fe=i 



(3) 



where A/,yvt is the number of cases for Xi in its !(th con- 



figuration and Pa{Xi) in the /'th configuration and 



N,, 



. aij and a^^ are user-determined Dirichlet 



priors which reflect a user's prior knowledge and we 
often set Uij^ = Nl{riqi). BDe score can penalize the 
structure complexity inherently by integrating 
P(D I 0s,S) and measuring the average expected likeli- 
hood over different possible choices of 0^ (9g is an 
estimate of parameters from the maximum likelihood 
method for the structure S) [35,36]. For AIC {Akaike 
information criterion) score and BIC [Bayesian infor- 
mation criterion) score, we can write a general score 
scheme as: 



Score[S \ D) = log P(D \e^,S)- C{S)f{N) 



(4) 



(5) 



where C{S) represents the structure complexity [32] . 
The first term of this score scheme measures the fitness 



Han and Chen BMC Genomics 2011, 12(Suppl 2):S9 
http://www.biomedcentral.com/1471-2164/12/S2/S9 



Page 7 of 8 



Algorithm bNEAT 

INPUT: Data D, Disease label node, all n 
SNP nodes 

OUTPUT: Disease SNP nodes, which has 
the maximum BIC socre on Disease label 
node 

Procedure [ 5; ]=bNEAT( ): Input: 
node set . Output: BIC score 5'i , parent 
set/;. 
Begin 

1. Compute BIC score tempS^ for , 
5'j = tempS^ , P\=V^ 

2. IF F|=null then i=0 else i=Fi (end) 

3. For i + \<q<n 
Begin 

(1) V2=V^'<Jq Compute BIC score 
tempSj for 

(2) IF tempS.^ > tempS^ then [S^ P^] = 
bNEAT (Fj) 

(3) IF52>5'i then5i = 52, P,^P^ 
End 

End 



between the structure, and data and the second term 
reflects structure complexity. With a maximum likeli- 
hood method, we can get 

n qt r,- 

i=l ;=1 k=l 

In (4), by setting /(A/) = 1, we get the AIC score as 
AIC{S I D) = log P{D\0s,S)- C[S) (7) 

If we set f{N) = l/21og(N), we get the BIC score, 
which is 

BIC[S I D) = log P[D \9s,S)-^C{S)log{N) (8) 

The BIC score are derived from a Taylor expansion 
and Laplace approximation when the number of sam- 
ples TV approaches °°. This results in a problem that the 
structure penalty term in (8) is very strict when the 
number of samples is small; therefore, we adjust the 
coefficient of the second term in (8) from 1/2 to a smal- 
ler number (in our applications, we empirically set it to 
be 0.17 for all the datasets we study). 

The computational task in score-and-search methods 
is to find a network structure with the highest score. 
The searching space consists of a superexponential 
number of structures-2°''' ) and thus exhaustively 
searching optimal structure from data for Bayesian net- 
works is NP-hard [37]. One simple heuristic search 
algorithm is greedy hill-climbing algorithm. In greedy 
hill-climbing algorithm, there are three types of opera- 
tors that change one edge at each step: 

♦ Add an edge 

♦ Remove an edge 

♦ Reverse an edge 

By these three operators, we can construct the local 
neighbourhood of current network. Then we select the 
network with the highest score in the local neighbour- 
hood to get the maximal gain. This process can be 
repeated until it reaches a local maximum. However, 
greedy hill-climbing algorithm cannot guarantee a global 
maximum [30]. Other structure learning methods for 
Bayesian networks include Branch-and-Bound (B&B) 
[38,39], genetic algorithms [40] and Markov chain 
Monte Carlo [41]. Branch-and-Bound algorithms guar- 
antee the optimal results in a significantly reduced 
search time compared to exhaustive search. Thus, we 
will employ B&B algorithms in our study. 

The proposed method uses B&B to search a structure 
that maximizes the BIC score. The algorithm is shown 
in Figure 5. bNEAT starts from an empty node set and 



Figure 5 the bNEAT algorithm 
^ J 



constructs a depth-first search tree to find the optimal 
parent (disease SNPs) set for the disease label node. In 
our B&B search, instead of using the pruning strategy as 
in [38,39], which sets a lower bound for the MDL score 
to prune the search tree, we stop the recursive calls 
when we observe that the BIC score will decrease on 
the children state of the current state. This strategy can- 
not guarantee global optima theoretically. However, it 
will significantly speed up the search process. 
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